Methods and system for dynamic reallocation of data processing resources for efficient processing of sensor data in a distributed network

ABSTRACT

Methods and system for dynamic reallocation of data processing resources for efficient processing of sensor data in a distributed network is provided. The methods and system include determining a data transmission cost f t ; determining a data processing cost f p ; determining a data storage cost f s ; and determining a data query Q which minimizes f(f t +f p +f s ) for a system of networked data processing resources.

TRADEMARKS

IBM® is a registered trademark of International Business MachinesCorporation, Armonk, N.Y., U.S.A. Other names used herein may beregistered trademarks, trademarks or product names of InternationalBusiness Machines Corporation or other companies.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to data networks and more particularly toefficient distributed processing of network sensor data.

2. Description of the Related Art

Recent advances in computer technology and wireless communications haveenabled the emergence of stream-based sensor networks. Broadapplications include network traffic monitoring, real-time financialdata analysis, environmental sensing, large-scale reconnaissance,surveillance, etc. In these applications, real-time data are generatedby a large number of distributed sources, such as sensors, and must beprocessed, filtered, interpreted or aggregated in order to provideuseful services to users. The sensors, together with many other sharedresources, such as Internet hosts and edge servers, are organized into anetwork that collectively provides a rich set of query processingservices. Resource-efficient data management is a key challenge in suchstream-based sensor networks.

Two different approaches are commonly used in the prior art for managingand processing data in such networks. The first approach involvessetting up a centralized data warehouse (e.g. data fusion center) towhich sensors push potentially useful sensor data for storage andprocessing. Users can then make a variety of sophisticated queries onthe data stored at this central database. Clearly, this approach doesnot make efficient use of resources because all data must be transmittedto the central data warehouse whether or not it is of interest.Moreover, this may require a large investment in processing resources atthe warehouse while at the same time that processing resources withinthe network are under utilized.

The second approach involves pushing queries all the way to the remotesensors. Such querying of remote sensor nodes is generally morebandwidth efficient. However, it is greatly limited by the lowcapability, low availability and low reliability of the edge devicessuch as sensors.

Thus, a third approach, and an object of an embodiment of the presentinvention, pushes query processing into the network as necessary isdesirable in order to reduce data transmission and better utilizeavailable shared resources in the network. Furthermore, given thatvarious queries are generated at different rates and only a subset ofsensor data may be actually queried, caching some intermediate dataobjects inside the network advantageously increases query efficiency.

In addition, most distributed stream processing systems in the prior artavoid the placement problem and assume that operator locations arepre-defined, and thus are unable to adapt to varying network conditions.In-network query processing has mostly focused on the operator placementproblem. At least one prior art solution considered the placement ofoperators so as to improve performance by balancing load where thecommunication bandwidth has not been taken into consideration. Anotherprior art solution considered queries involving only simple operationslike aggregation. In this case, the communication cost dominates and itis feasible to perform all operations as close to the sensors aspossible. Yet another prior art solution considered queries involvingmore sophisticated operations with non-negligible computational costs,and developed an operator placement algorithm for the special case inwhich the query graph is a sequence of operations and the sensor networkis a hierarchical tree.

Other prior art solutions considered the operator placement problem fortree-structured query graphs and general sensor network topologies usingsimple heuristics and localized search algorithms. However, both ofthese prior art solutions risk never finding a good placement.

In addition, many of these prior art solutions assume that queries aregenerated at the same rate as the data and are applied to all data. Theydo not exploit the fact that some queries may be generated at lowerrates such that they need only be applied to a fraction of the datagenerated.

It will be appreciated that there exists a need to determine the optimalnetwork locations at which to execute specific query operations andstore intermediate data objects. Intuitively, one would like to placethe operators as close as possible to the edge devices (sensors) so asto reduce transmission costs. However, devices close to the edge arelikely to have limited processing and storage capabilities. Thus theymay not be capable of handling sophisticated queries.

It will also be appreciated that there exists a need to balance theseconflicting effects so as to achieve the minimum overall cost incomputation, storage and communication through an efficient dynamicreallocation of data processing resources.

Embodiments and aspects of the invention are described in detail hereinand are considered a part of the claimed invention. For a betterunderstanding of the invention with advantages and features, refer tothe description and to the drawings.

SUMMARY OF THE INVENTION

Key features of embodiments of the present invention utilize the limitedbandwidth, computation and storage capacities of a distributed networkfor efficient in-network processing in anticipation of a mix of queriesthat are only known stochastically.

Other key features of embodiments of the present invention determine theplacement of various querying operators, and also account for therandomness of the querying process. Embodiments of the invention alsoaccount for different queries with different query frequencies and thatonly a subset of sensor data may actually be queried. Each query isassociated with a given query graph, consisting of operators and dataobjects, that describes how a response to the query is obtainedlogically. Key features of the invention integrate operator placementand caching together by determining node assignment for both operatorsand intermediate data objects that yield the minimum totalcommunication, computation, and storage cost.

In accordance with one embodiment of the present invention A method fordynamic reallocation of data processing resources for efficientprocessing of sensor data in a distributed network is provided. Themethod includes networking data objects, each data object initializingits respective data states; and each data object distributing itsrespective data states to at least one of the other data objects. Themethod also includes the at least one nearest neighbor data objectreceiving data states and determining a data processing cost associatedwith the received data states; and updating its respective data statesif the data processing cost associated with the received data states isreduced.

The invention is also directed towards a system for dynamic reallocationof data processing resources for efficient processing of sensor data ina distributed network. The system includes networked data objectsinitializing and distributing its respective data states to at least oneother data objects. In addition, each of the networked data objects iseither a relay node, a sensor node, or a fusion node. Each node includesa data transmission cost f_(t) module for cost associated withtransmission of the sensor data; a data processing cost f_(p) module forcost associated with data processing of the sensor data; a data storagecost f, module for cost associated with data storage of the sensor data;and a data query Q minimization module for minimizing cost associatedwith data transmission cost.

Technical Effects

As a result of the summarized invention, technically we have achieved asolution which improves dynamic reallocation of data processingresources for efficient processing of sensor data in a distributednetwork.

In accordance with one embodiment of the invention, a program storagedevice readable by a machine, tangibly embodying a program ofinstructions executable by the machine to perform a method for A programstorage device readable by a machine, tangibly embodying a program ofinstructions executable by the machine to perform a method for dynamicreallocation of data processing resources for the efficient processingof sensor data within a plurality of nearest neighbor nodes. The methodincludes determining a data transmission cost f_(t) of the sensor datafor each of the plurality of nearest neighbor nodes; determining a dataprocessing cost f_(p) of the sensor data for each of the plurality ofnearest neighbor nodes; determining a data storage cost f_(s) of thesensor data for each of the plurality of nearest neighbor nodes; anddetermining a data query Q which minimizes f(f_(t)+f_(p)+f_(s)).

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter which is regarded as the invention is particularlypointed out and distinctly claimed in the claims at the conclusion ofthe specification. The foregoing and other objects, features, andadvantages of the invention are apparent from the following detaileddescription taken in conjunction with the accompanying drawings inwhich:

FIG. 1 is a pictorial diagram of a an exemplary data stream processingenvironment with distributed data sources that provide data streamsincorporating features of the present invention;

FIG. 2 is a pictorial diagram of a an exemplary data stream processingenvironment with distributed data sources that provide data streamsincorporating features of the present invention

FIG. 3 is a pictorial diagram of a an exemplary tree data streamprocessing environment with distributed data sources that provide datastreams incorporating features of the present invention shown in FIG. 2;

FIG. 4 is a pictorial diagram of a an exemplary data stream processingenvironment with distributed data sources that provide multiple datastreams incorporating features of the present invention shown in FIG. 2;

FIG. 5 is a pictorial diagram of another exemplary tree data streamprocessing environment with distributed data sources that provide datastreams incorporating features of the present invention;

FIG. 6 is a pictorial representation of a three-tier sensor networkincorporating features of the present invention shown in FIG. 1;

FIG. 7 is a graphical representation of two prior art solutions comparedwith a solution set provided by a method of the present invention;

FIG. 8 is another graphical representation of two prior art solutionscompared with a solution set provided by a method of the presentinvention;

FIG. 9 is pseudo code which may be utilized to implement a method andsystem of the present invention shown in FIG. 1;

FIG. 10 is pseudo code which may be utilized to implement a method andsystem of the present invention shown in FIG. 1;

FIG. 11 is pseudo code which may be utilized to implement a method andsystem of the present invention shown in FIG. 1;

FIG. 12 is pseudo code which may be utilized to implement a method andsystem of the present invention shown in FIG. 1;

FIG. 13 is enabling description of features of the present inventionshown in FIG. 1;

FIG. 14 is enabling description of features of the present inventionshown in FIG. 1; and

FIG. 15 a pictorial representation of a data processing system which maybe utilized to implement a method and system of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Referring to FIG. 1, there is shown an exemplary data stream processingenvironment with distributed data sources that provide data streams. Theenvironment may consist of any suitable number of nodes, 1-10, (sensorsand edge servers), capable of executing multiple stream-orientedoperators. These nodes 1-10, which may be geographically dispersed, areorganized into a network 15 and collectively provide a rich set of queryprocessing services for multiple concurrent stream-based applications.

Referring also FIG. 15, there is shown a block diagram of a nodeincorporating features of the present invention shown in FIG. 1. It willbe seen that in FIG. 15 there is shown input devices keypad 120, andinput device 150. Also shown is display 130 and antenna 110. Node 100includes a processor 210 that is coupled to the antenna 110, keypad 120,display 130, and input device 150. Input device 150 may be any suitablephysical sensor that monitors the environment and collects data. Antenna110 may be any suitable network connection such as a local area network(LAN), a wide area network (WAN) and/or any suitable network connection.

In addition, processor 210 is also coupled to a memory 230. Memory 230may include any combination of volatile and non-volatile memory.Processor 210 takes input from keypad 120, antenna 110, and memory 230,and generates appropriate output on display 130, antenna 110, and/ormemory 230. For clarity the FIG. 15 diagram shows only the mostcommonly-known components and features that allow a complete descriptionof the preferred embodiments of the present invention.

Still referring to FIG. 1 memory 230 also contains database 215.Database 215 may contain a data transmission cost f_(t) module for costassociated with transmission of the sensor data; a data processing costf_(p) module for cost associated with data processing of the sensordata; and a data storage cost f_(s) module for cost associated with datastorage of the sensor data; and a data query Q minimization module forminimizing cost associated with data transmission cost, data processingcost, and data storage cost.

As used herein, a node includes any device for providing data output.However, one skilled in the art will recognize that the teachings hereinmay be used with a variety of data devices other than the node shown inFIG. 15. Accordingly, nodes such as shown in FIG. 15 are merelyillustrative of certain embodiments for data processing resources.

The sensor network 15 is represented by a directed graph G_(N) (V,E), asillustrated in FIG. 1. The edges of the network 15 correspond tocommunication links which could be wired or wireless. Also, if (i, j) εE, then (j,i) ε E. That is, communication is feasible in both directionsfor every pair of connected nodes 1-10. Also, |V|=N, and there are atleast three types of nodes in the network: e.g., sensor nodes, relaynodes, and a fusion center.

Sensor nodes: Sensor nodes may be any suitable physical sensor thatmonitors the environment and collects data. They may or may not have thecapability to process data. Sensor nodes may be, for example, athermometer, smoke detector, or a camera. It will be appreciated that asensor node may be any suitable device that responds to a physicalstimulus such as heat, light, sound, pressure, magnetism, or motion, andtransmits a resulting impulse or signal. It will also be understood thatany suitable sensor node will contain the logic and resources necessaryfor the dynamic reallocation of data processing resources for theefficient distributed processing of sensor data.

Relay nodes: Nodes that receive and forward data. In general, relaynodes have the capability of processing data.

Fusion center: The node to which users make queries and obtain responsesbased on sensor data.

These nodes have varying computation and data transmission capabilities.They may also have storage space to cache data. Further, each node onlyhas local knowledge about its neighboring nodes. Let ()i denote the setof the neighbors of node i such that for each j ε Φ_(i), (i,j) ε E or(j,i) ε E.

Discrete Stream-Based Data Generation

Sensor nodes monitor and collect data about the environment by samplingit periodically. Further, time may be discretized, i.e., each sensortakes one sample during a time frame. Then, a snapshot of the wholeenvironment may taken by the network at each time frame or any suitabletime frame. Note that it does not require strict synchronization of thevarious nodes in the network. Each node simply needs to index thesnapshots in sequence. For clarity, the following description focuses ona single snapshot. However, it will be appreciated that any suitablenumber of snapshots may be used.

Query Graph

Still referring to FIG. 1, the network 15 supports a set of queries.Each query Q is represented by a query graph G_(Q), consisting ofoperators and data objects. Such a graph can be built either directly bythe user, or derived by a query planner. In general, the query graphsare logical graphs depicting the interaction between various operatorsand data objects. The query graphs presented herein are decoupled fromthe physical network topology so as to focus more on the semantics andwork flow relations and better describe embodiments of the presentinvention.

It will be understood, however, that an underlying physical network willcontain the logic and resources necessary for optimizing dynamicreallocation of data processing resources for efficient distributedprocessing of sensor data in accordance with embodiments of theinvention presented herein.

A query graph is a directed acyclic graph (DAG) G_(Q)=(V_(o), V_(d), E)that describes how a response to the query is obtained logically. Here,Vo is a set of operators, Vd is a set of data objects. In words, a querygraph consists of data objects and operators, where the operators areessentially programs which aggregate or process existing data togenerate new data.

Referring to FIG. 2, there is shown a query graph. For example,Operator-7 (item 277) generates intermediated data-7 (item 27) andrequires as inputs data-1 (item 21) and data-2 (item 22). Likewise,Operator-6 (item 266) generates intermediate data-6 (item 26) andrequires as inputs data-3 and data-4, items 23 and 24, respectively.Operator 8 (item 288 generates final data-8 (item 28) and requires asinput intermediate data-7 and data-8, items 27 and 28, respectively.Similarly, Operator-9 generates final data-9 (item 29) and requires asinput intermediate data-6 (item 26) and data-5 (item 25). Let |Vd|=M. Ina preferred embodiment the data objects are divided into three classes:

-   -   (1) Sensor data (raw data): Data generated by the sensor nodes,        e.g., data-1 (item 21), data-2 (item 22), data-3 (item 23),        data-4 (item 24), and data-5 (item 25) in FIG. 2. It will be        appreciated that any suitable number of sensor nodes generating        sensor data may be used;    -   (2) Final data: Answers to the query, e.g., data-8 (item 28) and        data-9 (item 29) in FIG. 2. The final data needed to be        delivered to fusion center 201; and    -   (3) Intermediate data: Data that are neither sensor data nor        final data, for example data-6 (item 26) or data-7 (item 27) in        FIG. 2.

For purposes of description clarity, only one data object generated byeach operator is shown. Thus the operator can be assigned the same labelas the data generated. It will be understood that each operator mayrequire multiple data objects as input, and each data object may berequested as input by multiple operators.

As previously mentioned, one snapshot of the environment is created ateach time slot. Also queries may be generated at a lower rate and thatthey are applied only to a fraction of the data generated, therefore,for a given query Q with query graph G_(Q), each snapshot will beevaluated by G_(Q) with probability q.

Note the following notations:

-   -   Δk: The set of data objects in Vd that are needed to generate        data object k. For example, Δ7={1, 3} in FIG. 2. We refer to the        data objects in Δk as children of data k.    -   Dk: The number of bits of data k.

The sensor network G_(N) is given and a query Q (with query graph G_(Q))is posed at the fusion center (FIG. 2, item 201). In accordance with anembodiment of the present invention a method for efficiently obtainingthe final answer to the query is provided. It will be appreciated thatsome prior art solutions transmit all sensor data to the fusion center,and let the fusion center do the processing and aggregation. But thiscan be very expensive in the case that high volume data sources areinvolved, and the final answer could be very simple. In a preferredembodiment, one method is to allow partial processing and aggregationwithin the network (FIG. 1, item 15), where the data are combined orfiltered in the network (FIG. 1, item 15) so as to reduce the number oftransmissions.

Query Processing Scheme

A query processing scheme (Π,R,Ω) in accordance with an embodiment ofthe present invention solves the following three problems.

-   -   (1)Operator Placement (Π): Let Π denote the operator placement        matrix such that Π(k,i)=1 if data k is generated at node i. That        is, the operator generating data k is placed at node i. For        clarity, assume each operator can be placed at only one node,        thus Π(k, j)=0 for all j=i. Although redundant placement may be        beneficial, especially when data k is requested by multiple        upstream operators.    -   (2)Routing (R): Give a placement of operators, route the data        from the node it is generated to the node it is required as        input. For example, all data in Δk (children of data k) need to        be transmitted to node i as these are required input data in        order to generate date k.    -   (3)Data Caching (Ω): Here, Ω is the caching matrix, Ω(k,i)=1 if        data k is cached at node i.

Routing and operator placement clearly impact the efficiency of thequery processing. It may be less obvious why one should be concernedwith data caching. Consider two simple caching schemes:

-   -   (1)PULL: Cache the raw data where they are generated. Transmit        and process relevant data only when a query is acquired.    -   (2)PUSH: Transmit and process all raw data for all queries and        deliver results to the fusion center.

The disadvantage of PULL is that it requires the fusion center to signalthe sensor nodes each time that data is needed. This incurs acommunication cost with each query; the advantage is that the data areonly transmitted when a query is made.

The disadvantage of PUSH is that it is always transmitted to the fusioncenter. In the absence of a query, a communication cost is alwaysincurred. This may limit the total number of queries that can besupported. The advantage is that no additional transmission is requiredat the time of a query.

Thus, the choice of PULL or PUSH depends on q. If q=1, i.e., we know forsure that a query is made on each snapshot, PUSH is the best choice.Similarly, if q is zero or close to zero, PULL is best. On the otherhand, when q lies between zero and one, it may be more beneficial tocache data at nodes in the middle of the network. This may produce acost savings over that of PUSH or PULL. Thus, choosing the properlocation at which to cache the (intermediate or final) data should alsobe considered when designing efficient query processing schemes.

Cost Metrics

A query process scheme (R, Π, Ω) is given for query Q with query graphG_(Q) on sensor network G_(N). Let f_(cost) (G_(N),G_(Q),R,Π,Ω) denotethe cost of this scheme; it accounts for three different costs.

-   -   (1) Transmission cost f_(t)(G_(N),G_(Q),R,Π,Ω): The cost used to        transmit the data. Let r_(t)(i,j) denote the cost to transmit        one unit of data over link (i,j). In general,        r_(t)(i,j)=r_(t)(j, i).    -   (2) Processing cost fp (G_(N),G_(Q),R,Π,Ω): The cost of        processing and aggregation. Let r_(p)(k,i) denote the cost to        generate data k at node i. For the sensor data, r_(p)(k, i)=0 if        data k is pinned to node i; and r_(p)(k,i)=∞ otherwise.    -   (3) Storage cost fs (G_(N),G_(Q),R,Π,Ω): The cost used to        store/cache the data. Let r_(s)(i) denote the cost of storing        one unit data at node i. If there is no storage space available        at node i, r_(s)(i)=∞.

Thus, the total cost of scheme (R, Π, Ω) is f_(cost)(G_(N),G_(Q),R,Π,Ω)=(ft+fp+fs) (G_(N),G_(Q),R,Π,Ω).

Minimum Cost Optimization

An embodiment of the present invention determines the query process (R*,Π*, Ω* ) that solves:

arg min f_(cost)(G_(N),G_(Q),R,Π,Ω).   (Problem 1)

Consider a special case where the holding cost is zero and q=1. In thiscase, PUSH is the best caching scheme and the optimization problemsimplifies to:

(R*,Π*)=arg min f _(cost)(G _(N) ,G _(Q) ,R,Π).

Even for this case, the optimization problem is known to beN_(P)-Complete as known in the prior art. It is known in the art thatthe optimization problem can be solved using a centralized algorithmwhen the query graph has tree structure with complexity O (M_(N) 2):Recall that N is the number of nodes in the network and M is the numberof data items in the query graph.

However, while centralized algorithms known in the art solves theoptimization problem, it has disadvantages solved by embodiments of thepresent invention. For example, the prior art requires the fusion center(FIG. 2, item 201) to know the entire topology of the network.Furthermore, when the network topology changes, say due to the failureof a node, the whole scheme needs to be re-calculated. Thus, in apreferred embodiment of the present invention the nodes (FIG. 1, Items1-10) only need to know the information of their neighbors and theoptimal scheme is obtained by exchanging information between neighboringnodes.

Referring to FIG.3, there is shown a tree structured query graph.Distributed methods, in accordance with embodiments of the presentinvention, which solve the general optimization problem with complexityO(M 2_(N) 3) are provided herein. Furthermore, these methods allow theoptimal scheme to be re-calculated very efficiently as the networktopology changes. After investigating the tree structure, heuristicsolutions will be provided for general structured queries.

Tree Structured Queries

In this section, embodiments of the present invention provide solutionsto the optimization problem for the case that the query has a treestructure, e.g., FIG. 3. For a first example a special case where q=1and the holding cost is zero is considered. Then the method, inaccordance with embodiments of the present invention is extended to thecase where q ε (0, 1) and the holding cost is not zero.

Minimum Cost Routing and Operator Placement

First, let q=1 and the holding cost is zero. From the description above,it is will be understood that the best caching scheme is PUSH sinceevery snapshot will be queried. Thus, (R*,Π*)=arg min f_(cost)(G_(N),G_(Q),R,Π).

During the process of solving the optimal placement in a distributedfashion, each node, i, maintains for each data item k the followinginformation: Cp(k,i), Fp(k,i), Πp (k,i) and I (k,i). In particular, Cp(k,i) is the current cost of the best query process scheme to obtaindata k at node i; Fp(k,i) is the neighbor that transmits data k to nodei in the current scheme; Πp(k,i)=1 if data k is generated at node i inthe current scheme, and =0 otherwise; and I (k,i)=1 if the state of datak has been updated, but not been distributed to node i's neighbors, and=0 otherwise.

The distributed optimal placement method is given in Method 1 shown inFIG. 9. First the states of the data are initialized according to lines1-4, and then, the following two steps are executed repeatedly:

-   -   (i) Lines 6-10: If I (k,i)=1, which means data k's states at        node i have been updated but not distributed to node i's        neighbors, then node i sends data k's states to all its        neighbors and I (k,i) is set to zero.    -   (ii) Lines 11-19: According to the information received from its        neighbors, node i checks whether it should update the states of        its data. For each data k, node i first checks the cost is        reduced by obtaining data k from its neighbors (lines 13-16).        Then, node i checks whether it is better to generate data k at        node i (lines 17-19). Let I (k,i)=1 if data k's states are        updated.

The method terminates when no additional information is exchangedbetween nodes. Note that the method in this embodiment is synchronous,nodes have to wait for the completion of step (i) to execute step (ii),and wait for the completion of step (ii) to execute step (i).

Method 1 shown in FIG. 9, solves the optimization problem with M (N−1)iterations. For understanding, note the following definitions:

-   -   (1)Routing path α (k,i,j): A path over which data k flows from        node j to node i.    -   (2) Data collection tree β (k,i) rooted at node i: A tree over        which data k flows to node i. If data k is the sensor data        generated at sensor j, then β (k,i)=α (k,i,j). If data k is an        intermediate data or the final data, and generated at node j,        then β (k,i)=α (k,i,j) ∪ (∪m εΔ_(kβ) (m,j)).    -   (3)Depth H (β): The maximum path length from the root to a leaf.        If data k is the sensor data, H (β(k,i)) is the number of hops        path β (k,i) tales; otherwise if data k is an intermediate data        or the final data, and generated at node j,

H(β(k,i))=H(α(k,i,j))+max H(β(m,j)).

Referring to FIG. 13, there is shown a theorem and corresponding proofthat, in accordance with embodiments of the present invention,Cp₀(k,i)=Γ_(p0)(k,i), and that

Cp(k,i)=Γ_(pn)(k,i)

for data k such that Δk=/0. Then, by induction on n and on the querytree (from the leafs to the roots), we conclude that equation (1) shownin FIG. 13 holds for all n, k and i. Since there are N nodes and M dataobjects, in the worst case,

H(β(Final Data,Fusion Center))=M(N−1).

Further, there are N nodes in the network, and at each step, each nodessend at most M_(N) messages. Thus, the number of messages exchanged isO(M2N 3). Method 1 shown in FIG. 9, embeds the query graph into thesensor network (FIG. 1, item 15). Each participating node (items 1-10)knows the identities of the neighbors from which to obtain data, but notthe nodes to which it delivers data. The network then executes thefollowing trace back process.

-   -   (i) Generate message (Data k*) at the fusion center, e.g., FIG.        2, item 201, where data k* is the final data.    -   (ii) If node i receives message (data k) from node j, it needs        to transmit data k to node j whenever data k is available.        Furthermore,        -   (a) If Π(k,i)=1, node i sends message (data m) to node Fp            (m,i) for each m ε Δ_(k), and node i needs to generate data            k whenever data k's children are available.        -   (b) If Π(k,i) 6=1, node i sends message (data k) to node Fp            (k,i).

It will be readily understood by those skilled in the art that each nodewill know how to transmit and process data according to (R*,Π*) afterthe trace back process terminates.

Distributed Caching

In the previous subsection, the example focused on the case where q=1corresponding to the situation where every data snapshot is queried. Inthis subsection, an example of a more realistic case, i.e., where q<1 isprovided; the user does not query every snapshot, but only those data inwhich it is interested. As described earlier, PUSH is the optimal schemefor q=1. But for q<1, caching can further reduce the cost. Now, anycaching scheme must account for the fact that different nodes havedifferent amounts of storage space. For example, a sensor node, e.g.,items 21-25 shown in FIG. 2, may only have limited storage space; on theother hand, a relay node, items 7-8 shown in FIG. 2 and the fusioncenter, item 201 may have substantial storage. Thus, a preferredembodiment of the present invention caches data at nodes withsubstantial storage, and associates different holding costs to differentnodes to reflect this fact. Also, when cache data in the middle of thenetwork, the nodes caching the data need to be notified each time aquery is generated. For purposes of discussion, assume the size of thesignaling message needed to do this is Dq.

In this subsection, a distributed method that determines the minimumcost scheme is provided in accordance with embodiments of the presentinvention. First, each node i needs to maintain for each data k thefollowing information: Cp (k,i), Fp (k,i), Πp(k,i), Cs (k,i), Fs (k, j),Πs (k,i), Ω(k,i) and I (k,i). In particular, Cp (k,i), Fp (k,i) and Πp(k,i) are associated with the current best processing scheme to obtaindata k at node i; while Cs (k,i), Fs (k,i), Πs(k,i), and Ω(k,i) areassociated with the current minimum cost scheme involving caching, whereΩ(k,i) represents the caching decision with Ω(k,i)=1 if data k is cachedat node i, and 0 otherwise.

It will be understood by those skilled in the art that Γ_(pn)(k,i)+r_(p) (k,i) is the minimum cost to cache data k at node i, amongall data collection trees β (k,i) such that H (β (k,i))≦n. So Cp (k,i),Fp(k,i) and Πp (k,i) are used to check whether caching data k at node iis better than pulling data k from other nodes when the query isinjected.

If the final data, i.e., data from data-8, item 28 and/or data-9, item29, shown in FIG. 2, is not pushed to the fusion center 201, the sensorsor the nodes caching the data each time a query is injected arenotified. Thus, to obtain the minimum cost query processing scheme, theminimum cost path to transmit the required signaling messages isrequired.

Referring to FIG. 11, there is shown Method 3 for finding the minimumcost path to transmit the required signaling message. Thus, node i needsto maintain for each node j (not just for j ε Φ_(i)) the followinginformation: R(i,j), G(i,j) and Ib (i,j). In particular R(i,j) is thecurrent cost to transmit one unit of data from node i to node j; G(i,j)is the next hop node i should take to reach node j; and Ib(i,j)=1 if theinformation has been updated, but not distributed to node i's neighbors.

Still referring to FIG. 11, the distributed method to obtain the minimumcost scheme is given as Method 3. Similar to Method 1 shown in FIG. 9,Method 3 first initializes the network states according to Method 2shown in FIG. 10. After that, Method 3 repeats following two steps:

-   -   (i) Lines 3-8: If node i's state has been updated, node i needs        to send the updates to its neighbors.    -   (ii) Lines 9-31: Node i updates its state according to the        information received from its neighbors. It first uses the        distributed Bellman-Ford algorithm, or any suitable method, to        update R(i,j), G(i,j) and Ib (i,j) (lines 10-13). Then, node i        updates the cost related to caching (lines 17-19 and lines        23-25). Finally, for each data k, node i compares the current        scheme with three different schemes according to the updated        information received: (1) Obtain data k from node j(j ε Φ_(i))        when the query is injected (lines 20-22); (2) Compute data k at        node j when the query is injected (lines 26-28); (3) Cache data        k at node j (lines 29-31). If one of three schemes is better,        node i chooses that scheme and updates data k's state.

FIG. 14 shows that Method 3 shown in FIG. 11 converges to the optimalscheme within N+M(N−1) iterations.

Similar to Method 1, FIG. 9, the following trace back process isprovided:

-   -   (i) Generate message (Data k*, Pull) at the fusion center, where        data k is the final data.    -   (ii) If node i receives message (Data k, Pull) from node j, then    -   (a) If Πs (k,i)=1 and Ω(k,i)=1, node i sends message (data in,        Push) to node Fs(m,i) for each m ε Δk, and needs to generate        data k whenever data k's children are available.    -   (b) If Πs(k,i)=1 and Ω(k,i)=0, node i sends message (data m,        Pull) to node Fs(m,i) for each m ε Δk, and needs to request data        m from node Fs (m,i) for each m ε Δk when there is a request for        data k at node i.    -   (c) If Πs (k,i)=0 and Ω(k,i)=1, node i sends message (data k,        Push) to node Fs(k,i).    -   (d) If Πs (k,i)=0 and Ω(k,i)=0, node i sends message (data k,        Pull) to node Fs(k,i), and needs to request data k from node Fs        (k,i) when there is a request for data k at node i.

If node i receives message (Data k, PUSH) from node j, it needs totransmit data k to node j whenever data k is available. Furthermore,

-   -   (a) If Πs(k,i)=1, node i sends message (data in, Push) to node        Fs (m,i) for each m ε Δk, and needs to generate data k whenever        data k's children are available.    -   (b) If Πs (k,i)=0, node i sends message (data k, Push) to node        Fs (k,i).

Asynchronous Methods

Methods 1 and 3 shown in FIG. 9 and FIG. 11, respectively, execute in asynchronous way. For example, referring to FIG. 9, nodes begin to updatetheir states (lines 11-19) only after all nodes finish distributing theupdated information to their neighbors (lines 6-10); similarly, nodesbegin to send out their updated information only after all nodes finishupdating their states. This requires that all nodes be fully synchronousand know when other nodes finish updating and finish sending out updatedinformation.

In this embodiment, Method 4, the methods are operated in anasynchronous way. Thus, each node updates its states right after itreceives information from its neighbors; and sends updated messages toits neighbors right after its states are updated. For example, anasynchronous version of Method 1 is shown in FIG. 12, Method 4, and wecan obtain the asynchronous version of Method 3 similarly. When themethods execute asynchronously, the methods are fully distributed.

Cost and Topology Change

Communication, computation, and storage costs may change over time.Furthermore, links can fail. In this section the methods shown above aremodified to accommodate such changes.

An example of a change is where one of the costs decreases, e.g., nodei's storage cost decreases due to the addition of storage space to nodei. It will be understood that it suffices to re-initiate execution ofthe methods (node i first updates its states according to the decreasein cost and then distributes the updated information to its neighbors,and then all nodes begin to exchange information). The method willconverge to the new optimal scheme in a finite number of steps.

Now consider the case of either a cost increase or a link failure (thecost is infinity). Simply executing the method will result in a “countto infinity” problem understood in the prior art. It may take aninfinite number of iterations for the method to converge to the newoptimal solution when a link fails, or B iterations to converge if thecost increase is B. Alternate methods may be used in these situations.Once such method includes the following two parts:

-   -   (1)Freeze: When the cost of a node or a link increases, it sends        out freeze messages. All data collection trees containing that        node or that link are frozen except for modifying the costs.        Then, the costs are re-calculated according to the cost change.        For example, consider Method 1 shown in FIG. 9, suppose Πs        (k,i)=1, and node i's processing cost increases by δ, then        -   (i) Node i first sets Cp (k,i)=Cp(k,i)+δ. Then it checks the            data h from the leaves to the root along the query tree. If            Π(h,i)=1 and Cp(m,i) changes for any m ε Δh, node i            re-calculates Cp(h,i). If Cp(k,i) increases for any data k,            node i distributes a freeze message (Data h, Cp(k,i),            Freeze) to its neighbors. Also, all states of node i are            frozen, and can only be modified when freeze messages            received.        -   (ii) If node i receives (Data As, Cp (k, j), Freeze) from            node j, it first sets Cp(k,i)=Cp (k, j)+D_(k)r_(t) (j,i) if            Fp (k,i)=j. Then, it checks and up dates all data h's states            as in (i). If any Cp (k,i) increases, node i distributes            freeze message (Data As, Cp(k,i), Freeze) to its neighbors.            The states of node i are frozen, and can only be modified            when a freeze message received.    -   (2)Unfreeze: When the states of all nodes are updated according        to the cost change, the nodes are unfrozen. After unfreezing,        those nodes whose states are updated request their neighbors to        send states to them, and then all nodes begin to exchange their        information using the distributed algorithms. The algorithm        terminates when there is no more information exchanged among        nodes. Using this Freeze-Unfreeze scheme, the new optimal scheme        can be obtained in finite steps.

Extensions

It will be appreciated that the distributed methods presented thus farsolve the minimum cost query processing problem for tree-structuredqueries. Alternated embodiments of the present invention extend themethods to handle more general query graphs and to account for finitecapacity on certain nodes.

General Queries

In this subsection, queries with more general underlying topologies areaddressed. As described earlier, Problem 1 is N_(P)-Complete for thecase of a general query graph. We propose a number of simple heuristicsto handle this.

Heuristic 1: For a general query graph that is not a tree, such as inFIG. 2, where one data may have more than one parent. One alternateembodiment temporarily removes some edges so that the resulting graphbecomes a tree or multiple trees. The methods presented thus far applythe distributed methods on each of the new trees to obtain the minimumcost scheme. Since each operator generates exactly one data object, itis only necessary to remove edges from data objects to operators, e.g.the edge from data-6, item 26 to operator 9, item 299 in FIG. 2.

Denote the optimal assignment π* for the original problem with totalcost C*. Similarly, denote π′ and C′ the assignment and total costcorresponding to the new trees and assert that C′≦C*. This is becausethe optimal placement π* is clearly feasible for the new trees, however,C′ does not need to include the transmission cost on the removed edges,thus must be smaller than C*. Now adding back to C′ the transmissioncost Δ of the removed edges, yields a feasible solution to the originalproblem. Hence C*≦C′+Δ. Therefore, if the query graph is tree-like suchthat one only needs to remove a limited number of edges to have treestructures, the distributed method is suitable for tree-like structures.

Heuristic 2: There exists one set of queries for which we can easilyadapt our earlier methods for trees to obtain the optimal solution.Transforming the graph shown in FIG. 2 to the graph in FIG. 4 byreplacing the edges from data data-6, item 26 to operator 8, item 288and operator 9, item 299 by a subgraph consisting of a three nodes, copyoperator 41 and two data items, 6 a, item 26A and 6 b, item 26B. Edgesare added from data-6, item 26 to the copy operator 41, from the copyoperator 41 to data items, 6 a, item 26A and 6 b, item 26B, and from 6a, item 26A to operator 8, item 288 and from data 6 b, item 26B tooperator 9, item 299. A computation cost of zero is associated with thecopy operator 41.

Next we reverse the edges from operator 9, item 299 to data-9, item 29,copy operator 41 to data-6 b, item 26B, and from 6 b to operator 9, item299. The resulting graph is then of a tree structure shown in FIG. 5.Now apply Method 1 shown in FIG. 9 with the following modification.During tracing and deciding whether or not to add a physical link, say(i,j), to a path corresponding to an edge within the query graph thathas been reversed, use a communication cost of r_(t)(j,i)D_(k) ratherthan r_(t)(i,j)D_(k). This accounts for the fact that the data willactually travel in the reverse direction. In the case of Method 2, shownin FIG. 10, the modification is similar, in fact it corresponds toswitching the signaling costs with the data costs for the edges of thequery graph that have been reversed as well as the directions of thelinks traversed, i.e., r_(t) (j, i)D_(k) rather than r_(t) (i, j)D_(k).Last, in both cases the back tracking phase must be modified to accountfor the reversed edges properly.

Note that the transformation that was described earlier is not specialto the query graph illustrated in FIG. 2. It can be applied to anysuitable query graph whose maximum node out degree is two. The resultingscheme is optimal over the class of all schemes that do not allowoperators to be replicated. This approach also applies for query graphswith higher out degree. For example, a query graph that has a dataobject with out degree three. In this example there are three differentorderings, each of which requires two copy operators. In order to obtainthe optimal solution, each combination has to be treated separately andthe one achieving the minimum cost identified after all three instanceshave been solved. It will be understood that as the out degree grows,the computational requirements grows exponentially.

Finite Capacity Constraints

So far we have assumed that there are no constraints on the computationcapacity and storage space. However, it will be appreciated that in somescenarios there may have constraints on the total amount of capacityavailable at certain nodes. In this section, we provide some heuristicsto address the problem.

In accordance with one embodiment of the present invention, the methodincreases the cost coefficients at the nodes that are severely capacityconstrained. Consider, for example, a sensor node k has little space forstorage, making the cost of storing a unit of data there extremely highwill discourage the distributed methods presented herein to store a dataobject at node k.

Another solution is to perform a form of distributed bin packing. Forexample, after an initial run of the distributed method, it is foundthat the assignment violates the capacity constraint associated withnode k. An embodiment of the distributed method in accordance with thepresent invention can then take out the minimum number of assigned itemsto meet the capacity constraints and re-allocate them to nearby nodesthat have residual capacity.

A third approach in accordance with embodiments of the present inventionis to introduce cost functions so that the cost is extremely high orinfinite whenever reaching the fall capacity of a given node. Thus,dynamic network rate control is established. It will be appreciated thatthe cost functions may be linear or non-linear.

The following sections highlight the advantages of the embodiments ofthe invention described above. A first comparison highlightstree-structured queries and shows that the distributed method inaccordance with embodiments of the invention presented herein cansignificantly reduce the cost of query processing.

Referring to FIG. 6 there is shown a three-tier sensor network 60. Thehighest tier is the fusion center 61, the middle tier is formed by relaynodes 62. Each relay node is responsible for several sensor nodes. Thesensor nodes form the lowest tier 63. In simulations, the number of therelay nodes and the number of the sensors who connect to the same relaynode are randomly chosen from [0,L]. L is varied from 5 to 20 toinvestigate the performance of the algorithms for the networks withdifferent size.

Three-tier tree-structured query graphs are randomly generated. Eachsensor data has probability ½ to be involved in the query, so there areapproximately L2/2 sensor data in each query graph. To investigategeneral queries, first generate two non-overlap tree-structured queries,and then randomly add 10% additional linlks to connect the two trees. Itis thus a tree-like query graph, and use Heuristic 1 to obtain theplacement solution.

The communication cost per unit data of each link is randomly chosenfrom [0, 10], and the computation cost of each operator and the storagecost per unit data are chosen to so that the nodes in the higher tierhave lower cost, and are approximately of the order of 1/10 of thecommunication cost. Embodiments of the invention are compared with thefollowing two algorithms:

-   -   1. Simple-Push: The sensors transmit the data to the relay nodes        after the data are generated. The relay node will process the        data if possible and then transmit to the fusion center.    -   2. Simple-Pull: Fusion center signals the sensor nodes when the        query is injected. Then the data are transmitted and processed        as the Simple-Push.

Choosing L from 5 to 20 and for each fixed L, the simulation is executed100 times, and at each time, a sensor network and a tree-structure queryare randomly generated. Referring to FIG. 7, there is shown the averagecosts of the two algorithms described above (items 71 and 72). Furtherthe cost of Method 3 (item 73) of an embodiment of the present inventionis approximately ½ of the cost of the Simple-Pull (item 72) and ⅓ of thecost of the Simple-Push (item 71)

Referring now to FIG. 8 there is shown Heuristic 1 as described above(item 83) compared with Simple-Pull 82 Simple-Push 81 and a lower bound84 of the query processing cost as discussed earlier. Observe thatHeuristic 1 (item 83 ) is very close to the optimal solution

The capabilities of the present invention can be implemented insoftware, firmware, hardware or some combination thereof.

As one example, one or more aspects of the present invention can beincluded in an article of manufacture (e.g., one or more computerprogram products) having, for instance, computer usable media. The mediahas embodied therein, for instance, computer readable program code meansfor providing and facilitating the capabilities of the presentinvention. The article of manufacture can be included as a part of acomputer system or sold separately.

Additionally, at least one program storage device readable by a machine,tangibly embodying at least one program of instructions executable bythe machine to perform the capabilities of the present invention can beprovided.

The diagrams depicted herein are just examples. There may be manyvariations to these diagrams described therein without departing fromthe spirit of the invention. The flow diagrams depicted herein are alsojust examples. There may be many variations to these diagrams or thesteps (or operations) described therein without departing from thespirit of the invention. For instance, the steps may be performed in adiffering order, or steps may be added, deleted or modified. All ofthese variations are considered a part of the claimed invention.

While the preferred embodiment to the invention has been described, itwill be understood that those skilled in the art, both now and in thefuture, may make various improvements and enhancements which fall withinthe scope of the claims which follow. For example, any suitable IPprotocol may be used. Such as, for example, Ipv4 or Ipv6. In addition,the previously described WECM or any suitable connection manager may beused. These claims should be construed to maintain the proper protectionfor the invention first described.

1. A program storage device readable by a machine, tangibly embodying aprogram of instructions executable by the machine to perform a methodfor dynamic reallocation of data processing resources for efficientdistributed processing of sensor data within a plurality of nearestneighbor nodes, the method comprising: determining a data transmissioncost f_(t) of the sensor data for each of the plurality of nearestneighbor nodes; determining a data processing cost f_(p) of the sensordata for each of the plurality of nearest neighbor nodes; determining adata storage cost f, of the sensor data for each of the plurality ofnearest neighbor nodes; and determining a data query Q which minimizesf(f_(t)+f_(p)+f_(s)).
 2. The program storage device readable by amachine as in claim 1, tangibly embodying a program of instructionsexecutable by the machine to perform the method for dynamic reallocationof data processing resources for efficient distributed processing ofsensor data within the plurality of nearest neighbor nodes, whereindetermining the data query Q which minimizes f(f_(t)+f_(p)+f_(s))further comprises: determining an operator placement matrix; determininga node routing matrix; and determining a data caching matrix.
 3. Theprogram storage device readable by a machine as in claim 1, tangiblyembodying a program of instructions executable by the machine to performthe method for dynamic reallocation of data processing resources forefficient distributed processing of sensor data within the plurality ofnearest neighbor nodes, wherein determining the operator placementmatrix further comprises assigning a unit value to each of the pluralityof nearest neighbor nodes generating sensor data.
 4. The program storagedevice readable by a machine as in claim 1, tangibly embodying a programof instructions executable by the machine to perform the method fordynamic reallocation of data processing resources for efficientdistributed processing of sensor data within the plurality of nearestneighbor nodes, wherein determining the node routing matrix furthercomprises: determining at least one generating node within the pluralityof nearest neighbor nodes; and determining at least one receiving nodewithin the plurality of nearest neighbor nodes.
 5. The program storagedevice readable by a machine as in claim 1, tangibly embodying a programof instructions executable by the machine to perform the method fordynamic reallocation of data processing resources for efficientdistributed processing of sensor data within the plurality of nearestneighbor nodes, wherein determining the data caching matrix furthercomprises: determining at least one intermediate data caching matrix;and determining at least one final data caching matrix.
 6. A method fordynamic reallocation of data processing resources for efficientprocessing of sensor data in a distributed network, the methodcomprising: networking a plurality of data objects each of the pluralityof data objects initializing its respective data states; and each of theplurality of data objects distributing its respective data states to atleast one of the plurality of data objects.
 7. A method as in claim 6wherein each of the plurality of data objects initializing itsrespective data states further comprises each of the plurality of dataobjects updating its respective data states.
 8. A method as in claim 7wherein each of the plurality of data objects distributing itsrespective data states to at least one of the plurality of data objectsfurther comprises each of the plurality of data objects distributing itsrespective data states to at least one nearest neighbor data object. 9.A method as in claim 8 wherein each of the plurality of data objectsdistributing its respective data states to at least one nearest neighbordata object further comprises each of the plurality of data objectsdistributing its respective data states to at least one nearest neighbordata synchronously.
 10. A method as in claim 8 wherein each of theplurality of data objects distributing its respective data states to atleast one nearest neighbor data object further comprises each of theplurality of data objects distributing its respective data states to atleast one nearest neighbor data asynchronously.
 11. A method as in claim8 wherein each of the plurality of data objects distributing itsrespective data states to the at least one nearest neighbor data objectfurther comprises: the at least one nearest neighbor data objectreceiving data states determining a data processing cost associated withthe received data states; and the at least one nearest neighbor dataobject receiving data states updating its respective data states if thedata processing cost associated with the received data states isreduced.
 12. A method as in claim 11 wherein determining the dataprocessing cost associated with received data states further comprises:determining a data transmission cost f_(t); determining a dataprocessing cost f_(p); determining a data storage cost f_(s); anddetermining a data query Q which minimizes f(f_(t)+f_(p)+f_(s)).
 13. Amethod as in claim 11 wherein determining the data query Q whichminimizes f(f_(t)+f_(p)+f_(s)) further comprises: determining anoperator placement matrix; determining a node routing matrix; anddetermining a data caching matrix.
 14. A method as in claim 6 whereinnetworking the plurality of data objects further comprises: networkingat least one sensor node; networking at least one relay node; andnetworking at least one data fusion center.
 15. A method as in claim 6wherein networking the plurality of data objects further compriseswireless networking the plurality of data objects.
 16. A system fordynamic reallocation of data processing resources for efficientprocessing of sensor data in a distributed network, the systemcomprising: a plurality of networked data objects initializing anddistributing its respective data states to at least one of the pluralityof data objects, wherein each of the plurality of networked data objectscomprises a node selected from the group consisting of a sensor node, arelay node and a data fusion center, wherein each of the plurality ofnetworked data objects comprise: a data transmission cost f_(t) modulefor cost associated with transmission of the sensor data; a dataprocessing cost f_(p) module for cost associated with data processing ofthe sensor data; a data storage cost f_(s) module for cost associatedwith data storage of the sensor data; and a data query Q minimizationmodule for minimizing cost associated with data transmission cost, dataprocessing cost, and data storage cost, wherein the data queryminimization module comprises: an operator placement matrix; a noderouting matrix; a data caching matrix, wherein the data caching matrixcomprises: at least one intermediate data caching matrix; and at leastone final data caching matrix.